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Abstract 

In survey statistics, the usual technique for estimating a population total 
consists in summing appropriately weighted variable values for the units in the 
sample. Different weighting systems exit: sampling weights, GREG weights 
or calibration weights for example. 

In this article, we propose to use the inverse of conditional inclusion probabil- 
ities as weighting system. We study examples where an auxiliary information 
enables to perform an a posteriori stratification of the population. We show 
that, in these cases, exact computations of the conditional weights are possi- 
ble. 

When the auxiliary information consists in the knowledge of a quantitative 
variable for all the units of the population, then we show that the conditional 
weights can be estimated via Monte-Carlo simulations. This method is applied 
to outlier and strata- Jumper adjustments. 

Keywords: Auxiliary information; Conditional inference; Finite population; Inclu- 
sion probabilities; Monte Carlo methods; Sampling weights 

1 Introduction 

The purpose of this article is to give a systematic use of the auxiliary information 
at the estimation phase by the means of Monte Carlo methods, in a design based 
approach. 

In survey sampling, we often face a situation where we use information about the 
population (auxiliary information) available only at the estimation phase. For ex- 
ample, this information can be provided by an administration file available only 
posterior to the collection stage. Another example would be the number of respon- 
dents to a survey. It is classical to deal with the non-response mechanism by a second 
sampling phase (often Poisson sampling conditional to the size of the sample). The 
size of the respondents sample is known only after the collection. 



This information can be compared to its counterpart estimated by the means of the 
sample. A significant difference typically reveals an unbalanced sample. In order 
to take this discrepancy into account, it is necessary to re-evaluate our estimations. 
In practice, two main technics exist: the model-assisted approach (ratio estimator, 
post-stratification estimator, regression estimator) and the calibration approach. 
The conditional approach we will develop in this article has been so far mainly a 
theoretical concept because it involves rather complex computations of the inclusion 
probabilities. The use of Monte-Carlo methods could be a novelty that would enable 
the use of conditional approach in practice. In particular, it seems to be very helpful 
for the treatment of outliers and strata jumpers. 

Conditional inference in survey sampling means that, at the estimation phase, the 
sample selection is modelized by means of a conditional probability. Hence, expec- 
tation and variance of the estimators are computed according to this conditional 
sampling probability. Moreover, we are thus provided with conditional sampling 
weights with better properties than the original sampling weights, in the sense that 
they lead to a better balanced sample (or calibrated sample). 

Conditional inference is not a new topic and several authors have studied the con- 
ditional expectation and variance of estimators, among them: Rao (1985), Robin- 
son(1987), Tille (1998, 1999) and Andersson (2004). Moreover, one can see that the 
problematic of conditional inference is close to inference in the context of rejective 
sampling design. The difference is that in rejective sampling, the conditioning event 
is controlled by the design, whereas, in conditional inference, the realization of the 
event is observed. 

In section 2, the classical framework of finite population sampling and some nota- 
tions are presented. 

In section 3, we discuss the well-known setting of simple random sampling where 
we condition on the sizes of the sub-samples on strata (a posteriori stratification). 
This leads to an alternative estimator to the classical HT estimator. While a large 
part of the literature deals with the notion of correction of conditional bias, we will 
directly use the concept of conditional HT estimator (Tille, 1998), which seems more 
natural under conditional inference. A simulation study will be performed in order 
to compare the accuracy of the conditional strategy to the traditional one. 

In section 4, the sampling design is a Poisson sampling conditional to sample size n 
(also called conditional Poisson sampling of size n). We use again the information 
about the sub-samples sizes to condition on. We show that the conditional proba- 
bility corresponds exactly to a stratified conditional Poisson sampling and we give 
recursive formula that enables the calculation of the conditional inclusion probabil- 
ities. These results are new. 

In section 5, we use a new conditioning statistic. Following Tille (1998, 1999), we 
use the non-conditional HT estimation of the mean of the auxiliary variable to con- 
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dition on. Whereas Tille uses asymptotical arguments in order to approximate the 
conditional inclusion probabilities, we prefer to perform Monte Carlo simulations to 
address a non-asymptotic setting. Note that this idea of using independent repli- 
cations of the sampling scheme in order to estimate inclusion probabilities when 
the sampling design is complex has been already proposed by Fattorini (2006) and 
Thompson and Wu (2008). 

In section 6, we apply this method to practical examples: outlier and strata jumper 
in business survey. This new method to deal with outliers gives good results. 

2 The context 

Let U be a finite population of size N. The statistical units of the population 
are indexed by a label k G {1,...,N}. A random sample without replacement s 
is selected using a probability (sampling design) p(.). S is the set of the possi- 
ble samples s. I\ kes ] i s the indicator variable which is equal to one when the unit 
k is in the sample and otherwise. The size of the sample is n(s) = \s\. Let 
B k = {s E S,k E s} = {s E S,I[ kes ] = 1} be the set of samples that contain k. For 
a fixed individual k, let n k = p(B k ) be the inclusion probability and let d k = — be 
its sampling weight. For any variable z that takes the value z k on the £7-unit k, the 
sum t z = J2keu Zk * s re f erre d to as the total of z over U . t Zjn = J2kes * s the 
Horvitz-Thompson estimator of the total t z . 

Let x be an auxiliary variable that takes the value x k for the individual k. The 
x k are assumed to be known for all the units of U . Such auxiliary information is 
often used at the sampling stage in order to improve the sampling design. For ex- 
ample, if the auxiliary variable is a categorical variable then the sampling can be 
stratified. If the auxiliary variable is quantitative, looking for a balanced sampling 
on the total of x is a natural idea. These methods reduce the size of the initial set 
of admissible samples. In the second example, Sbaianced = {sG S,t Xt7T = t x }. 
We wish to use auxiliary information after the sample selection, that is to take ad- 
vantage of information such as the number of units sampled in each stratum or the 
estimation of the total t x given by the Horvitz-Thompson estimator. Let us take 
an example where the sample consists in 20 men and 80 women, drawn by a simple 
random sampling of size n = 100 among a total population of iV = 200 with equal 
inclusion probabilities ir k = 0.5. And let us assume that we are given a posteriori 
the additional information that the population has 100 men and 100 women. Then 
it is hard to maintain anymore that the inclusion probability for both men and 
women was actually 0.5. It seems more sensible to consider that the men sampled 
had indeed a inclusion probability of 0.2 and a weight of 5. Conditional inference 
aims at giving some theoretical support to such feelings. 

We use the notation <&(s) for the statistic that will be used in the conditioning. <&(s) 
is a random vector that takes values in R q . In fact, <fr(s) will often be a discrete 
random vector which takes values in {1, ...,n} q . At each possible subset ip C <&(<S) 
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corresponds an event = = {s G S, &(s) G </?}. 

For example, if the auxiliary variable Xk is the indicator function of a domain, say 
x k = 1 if the unit A; is a man, then we can choose <&(s) = ^ fces /[ fee( fo ma m] = n domain 
the sample size in the domain (number of men in the sample). If the auxiliary vari- 
able quantitative variable, then we can choose <fr(s) = J2kes V~ = *x,ir the 
Horvitz-Thompson estimator of the total t x . 



3 A posteriori Simple Random Sampling Stratifica- 
tion 

3.1 Classical Inference 

In this section, the sampling design is a simple random sampling without replace- 
ment(SRS) of fixed size n; Ssrs = {s G S,n(s) = n}; p(s) = 1/{ N ) and the 
inclusion probability of each individual k is ti^ = n/N. Let y be the variable of 
study, y takes the value y^ for the individual k. The y k are observed for all the units 
of the sample. The Horvitz-Thompson (HT) estimator of the total t y = J2k&u Vk l& 
ty,HT = Ysk&u ^thkes}- 

Assume now that the population U is split into H sub-populations Uh called strata. 
Let Nh = \Uh\, h G {1, ■■■,H} be the auxiliary information to be taken into account. 
We split the sample s into H sub-samples defined by Sh = sHUh- Let rih(s) = \s h \ 
be the size of the sub-sample Sh- 

Ideally, to use the auxiliary information at the sampling stage would be best. Here, 
a simple random stratified sampling (SRS stratified) with a proportional allocation 
N h n/N would be more efficient than a SRS. For such a SRS stratified, the set of 
admissible samples is S S rs stratified = {s G S,Wh G [l,H],n h (s) = N h n/N}, and the 
sampling design is p(s) = Yl he[1:H] -p^r, s G S S rs stratified- Once again, our point 

is precisely to consider setting where the auxiliary information becomes available 
posterior to this sampling stage 



3.2 Conditional Inference 

The a posteriori stratification with an initial SRS was described by Rao(1985) and 
Tille(1998). A sample sq of size n(so) = n is selected. We observe the sizes of the 
strata sub-samples: rih(so) = ^2 keUh I[kes], h G We assume that V7i, nh(s ) > 

0. We then consider the event: 

A = {s eS,Vhe [l,H},n h {s) =n h {s )}. 
It is clear that s <E A , so A is not empty. 
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We consider now the conditional probability: p A °(.) — p(./A ) which will be used as 
far inference is concerned. The conditional inclusion probabilities are denoted 

4° = P A ° {[I[kes] = 1]) = E Ao (I [kes] ) = p ([I [kes] = 1] n A ) /p(A ). 

Accordingly, we define the conditional sampling weights: d A ° = — 

Proposition 1. 1. The conditional probability p A ° is the law of a stratified simple 
random sampling with allocation (ni(s ), "^h(so)), 

2. For a unit k of the strata h: tt a ° = -~r^~ and d A ° - 



N h n h (s ) 



Proof. \Aq\ (ni(so)) ^ "' ^ (fi//(so))' 

Vs e A Q , p At> (s) = 1/\A \. So we have: 



p A °(s) = I [seAo] 



Hhe[l,H] (n h (s )) 

he[l,H] W(so)/ 

= II Iln h (s)=n h (so)} * / Nh \ 
he[l,H] \n h {s )) 

and we recognize the probability law of a stratified simple random sampling with 
allocation (ni(s ), ^h(so))- 

2. follows immediately. □ 
Note that 

so that the genuine HT estimator is conditionally biased in this framework. 
Even if, as Tille(1998) mentioned, it is possible to correct this bias simply by re- 
trieving it from the HT estimator, it seems more coherent to use another linear 
estimator constructed like the HT estimator but, this time, using the conditional 
inclusion probabilities. 

Remark that in practice A should not be too small. The idea is that for any unit 
k, we should be able to find a sample s such that s E A and k G s. Thus, all the 
units of U have a positive conditional inclusion probability. 




Definition 1. The conditional HT estimator is defined as: 

ty,CHT = J[fce«] 
k&U k 
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The conditional Horvitz-Thompson (CHT) etimator is obviously conditionally un- 
biased and, therefore, unconditionally unbiased. 

This estimator is in fact the classical post-stratification estimator obtained from a 
model-assisted approach (see Sarndal et Al.(1992) for example). However, condi- 
tional inference leads to a different derivation of the variance, which appears to be 
more reliable as we will see in next subsection. 



3.3 Simulations 

In this part, we will compare the punctual estimations of a total according to two 
strategies: (SRS design + conditional (post-stratification) estimator) and (SRS de- 
sign + HT estimator). 
The population size is N = 500, the variable y is a quantitative variable drawn 

Comparison Qetween [iy CHT and ^ H t 




1600 1000 2000 2200 2400 

Mv: ht 

Figure 1: Punctual Estimation 

from a uniform distribution over the interval [0,4000]. The population is divided 
into 4 strata corresponding to the values of y^ (if y^ G [0, 1000 [ then k belongs to 
the strata 1 and so on ...). The auxiliary information will be the size of each strata 
in the population. In this example, we get N\ = 123, N2 = 123, N3 = 132 and 
iV 4 = 122. 

The finite population stays fixed and we simulate with the software R K = 10 3 sim- 
ple random samples of size n = 100. Two estimators of the mean \i y = 4 Ylk&u Vk 
are computed and compared. The first one is the HT estimator: £i y ,HT = ^2k&s Vk 
and the second one is the conditional estimator: fx y ,cHT — Y2h Y2keu h ^ fc ^V)^[ fc es]- 

On Figure [TJ we can see the values of fi y _HT and fi y ^cHT f° r each of the 10 3 sim- 
ulations. The red dots are those for which the conditional estimation is closer to 
the true value fi y = 2019.01 than the unconditional estimation; red dots represents 
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83.5% of the simulations. Moreover, the empirical variance of the conditional es- 
timator is clearly smaller than the empirical variance of the unconditional estimator. 

This is completely coherent with the results obtained for the post-stratification esti- 
mator in an model-assisted approach (see Sarndal et Al.(1992) for example). How- 
ever, what is new and fundamental in the conditional approach, is to understand 
that for one fixed sample, the conditional bias and variance are much more reliable 
than the unconditional bias and variance. The theoretical study of the conditional 
variance estimation is a subject still to be developed. 



3.4 Discussion 

1. The traditional sampling strategy is defined as a couple (sampling design + 
estimator). We propose to define here the strategy as a triplet (sampling design 
+ conditional sampling probability + estimator). 

2. We have conditioned on the event: Aq = {s G <S, V7i G n-h( s ) = nh(so)}. 
Under a SRS, it is similar to use the HT estimators of the sizes of the strata 
in the conditioning, that is to use «fr(s) = (Ni(s), Nh( s )Y) where Nh( s ) — 
Efce^ %f = Then, A = {s G S, = *(*„)}. We will see in 
Section [Sf the importance of this remark. 

3. The CHT estimations of the sizes of the strata are equal to the true strata 
sizes Nh, which means that the CHT estimations, in this setting, have the cal- 
ibration property for the auxiliary information of the size of the strata. Hence, 
conditional inference gives a theoretical framework for the current practice of 
calibration on auxiliary variables. 



4 A Posteriori Conditional Poisson Stratification 

Rao(1985), Tille(1999) and Andersson (2005) mentioned that a posteriori stratifi- 
cation in a more complex setting than an an initial SRS is not a trivial task, and 
that one must rely on approximate procedures. In this section, we show that it is 
possible to determine the conditional sampling design and to compute exactly the 
conditional inclusion probabilities for an a posteriori stratification with a conditional 
Poisson sampling of size n. 



4.1 Conditional Inference 

Let p(s) = YlkesPk llfc6s(^ — Pk) be a Poisson sampling with inclusion probabilities 
p = (pi, . . . ,PnY, where pu G]0, 1] and s is the complement of s in U. Under a 
Poisson sampling, the units are selected independently. 
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By means of rejective technics, a conditional Poisson sampling of size n can be 
implemented from the Poisson sampling. Then, the sampling design is: 

p(s) = K~ l l\ s \ =n Y[p k JJ(1 - P*)> 
fees fees 

where K = £ S;N=n UkesPk Ukesi 1 ~Pk)- 

The inclusion probabilities n k = f k (U, p, n) may be computed by means of a recur- 
sive method: 

fk{U,p,n) = Ph = j— — V —— — (1 -f k (U,p,n- 1)) 

where f k (U, p, 0) = 0. 

This fact was proven by Chen et al.(1994) and one can also see Deville (2000), Matei 
and Tille (2005), and Bondesson(2010). An alternative proof is given in Annex 1. 
It is possible that the initial 7i k of the conditional Poisson sampling design are known 
instead of the p k s. Chen et al.(1994) have shown that it is possible to inverse the 
functions f k (U, p, n) by the means of an algorithm which is an application of the 
Newton method. One can see also Deville (2000) who gave an enhanced algorithm. 

Assume that a posteriori, thanks to some auxiliary information, the population 
is stratified in H strata £4, h G The size of the strata £4 is known to be 

equal to Nh, and the size of the sub-sample Sh into Uh is nh{s$) > 0. We consider 
the event A = {s G S, V7i G [1, H], rih(s) = rih(so)}. 



Proposition 2. With an initial conditional Poisson sampling of size n: 

1. The probability conditional to the sub-samples sizes of the "a posteriori strata", 
p A °(s) —p(s/A ), is the probability law of a stratified sampling with (inde- 
pendent) conditional Poisson sampling of size n h (s ) in each stratum, 

2. The conditional inclusion probability Tr k ° of an element k of the strata Uh is 
the inclusion probability of a conditional Poisson sampling of size Uh(s ) in a 
population of size Nh- 

Proof. 1. For a conditional Poisson of fixe size n, a vector (p±, . . . ,PnY exists, where 
Pk G]0, 1], such that: 

p(s) = K-H^^Ylpk - P*)> 
fees fcss 

where K = £ s ,| s |=n UkesPk Ukesi 1 ~Pk)- 

We remind that A = {s G <S,V7i G [1, H], n h (s) = n h (s )} 

Then: 

p(A ) = K l p[ f| [n h (s) = n h (s Q )\ J 
\he[i,H] J 

= K- 1 \{ p([n h (s)=n h (s )},) 

he[l,H] 
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where, p(.) is the law of the original Poisson sampling. Let s G A , then: 



p A °(s) 



P{s) 
p(4>) 

K ~ X Y\he[i,H\P{[ n h{s) = n h {s )]) 
i-r Ukes h Pk llfcgg^ 1 -Pfc) 
h=i, .,h P~(M S ) = n h (s )\) 

JJ Y\k & s h Pk]\ k( ,-sS l - Pk) 

h=l,...,H 



^s h ,| Sh |=n h ( S0 ) Ilkes-hPk llfcesh( 1 Pk) 



which is the sampling design of a stratified sampling with independent conditional 
Poisson sampling of size rih(s ) in each stratum. 

2. follows immediately. □ 

Definition 2. In the context of conditional inference on the sub-sample sizes of 
posteriori strata, under an initial conditional Poisson sampling of size n, the con- 
ditional HT estimator of the total t y is: 



t 



EVk 

kds n k 



The conditional variance can be estimated by means of one of the approximated 
variance formulae developed for the conditional Poisson sampling of size n. See for 
example Matei and Tille(2005), or Andersson(2004). 



4.2 Simulations 
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Figure 2: Punctual Estimation 
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We take the same population as in subsection 3.3 The sampling design is now a 



conditional Poisson sampling of size n = 100. The probabilities p^ of the underlying 
Poisson design have been generated randomly, in order that J2keuP k = n an( ^ P k e 
[0.13; 0.27]. 

K = 10 3 simulations were performed. Figure [2] shows that the punctual estimation of 
the mean of y is globally better for conditional inference. According to 77.3% of the 
simulations the conditional estimator is better than the unconditional estimator (red 
dots). The empirical variance as well is clearly better for the conditional estimator. 



4.3 Discussion 

This method allows to compute exact conditional inclusion probabilities in an "a 
posteriori stratification" under conditional Poisson of size n. However, one can figure 
out that this method can be used for any unequal probabilities sampling design, had 
the sampling frame been randomly sorted. 



5 Conditioning on the Horwitz-Thompson estima- 
tor of an auxiliary variable 

In the previous sections, we used the sub-sample sizes in the strata n^(s) to condi- 
tion on. The good performances of this conditional approach result from the fact 
that the sizes of the sub-sample are important characteristics of the sample that 
are often used at the sampling stage. So, it was not surprising that the use of this 
information at the estimation stage would enhance the conditional estimators. 

Another statistic that characterizes the representativeness of a sample is its HT es- 
timator of the mean fi x (or total t x ) of an auxiliary variable. This statistic is used 
at the sampling stage in balanced sampling for example. So, as the sub-sample sizes 
into the strata, this statistic should produce good results in a conditional approach 
restraining the inference to the samples for which the HT estimation of fi x are equal 
to the value /to = P-x,ht{sq) of the selected sample so • 

In fact, we want the (conditional) set of the possible samples to be large enough in 
order that all conditional inclusion probabilities be different from zero. It is therefore 
convenient to consider the set of samples that give HT estimations not necessarily 
strictly equal to /to but close to /to- Let tp — [/to — e, /to + e], for some e > 0. 

The set A v of possible samples in our conditional approach will be: 

A ip = {s e S, £i x ,ht(s) G [/t - e,p, + e)}. 

The conditional inclusion probability of a unit k is: 

9 = P([k€ s\l [Vx,ht(s) G [/t - e, Ao + e]]) 

p ({s G S, k G s and {i X) ht{s) € [/tp - £, ftp + e}}) 



10 



If /to = l^x then we are in a good configuration, because we are in a balanced sam- 
pling situation and the ix k * will certainly stay close to the ilk- 

If /xq ^> fx x say, then the sample s is unbalanced, which means that in average, 
its units have a too large contribution x^/iTk, either because they are too big (x)~ 



large) or too heavy 



too large). In this case, the samples in A v are also 



ill-balanced, because balanced on /x instead of /xx- Xlfces ^ ~ Mo- But conditioning 

on this information will improve the estimation. Indeed, the ir^ will be different 
from the ir k . For example, a unit k with a big contribution large) has more 
chance to be in a sample of yl^ than a unit / with a small contribution. So, we can 
expect that 7r fc v > ?Tfc and < 7Tj. And, in consequence, the conditional weight 
ci^ will be lower than c4 and df higher than di, which will "balance" the samples of 



Discussion: 

• we can use different ways in order to define the subset ip. One way is to use 
the distribution function of $(s), denoted G{u) and to define ip as a symmetric 
interval: 



G-\mB*{G{*{8 )) - ^}),G-\min{G{<S>{s )) + |, 1)}' 



where a = 5% for example. 



Hence, 



A v = {se S, G [G- 1 (max{G'($( S o)) - \ , 0}), G- 1 (min{G($( S o)) + f , 1)}] }, 
and p(A^) < a. 



As the cdf G(u) is unknown in general, one has to replace it by an estimated 
cdf of 3>(s), denoted Gk(u), computed by means of simulations. 



6 Generalization: Conditional Inference Based on 
Monte Carlo simulations. 

In this section, we consider a general initial sample design p(s) with the inclusion 
probabilities 7r fc . We condition on the event A v = ^~ 1 (ip) = {s G S, <&(s) G </?}. 
For example, we can use $(s) = Xlfees ^ the unconditional HT estimator of t x and 
ip = [ip i, Lp 2 ] an interval that contains $(so) — Sfce so the HT estimation of t x 
with the selected sample s . In other words, we will take into account the informa- 
tion that the HT estimator of the total of the auxiliary variable x lies in some region 
ip. 
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The mathematical expression of ir A<p is straightforward: 

4 v =p([kes]/A ip ) = ^ p{s)lseA ^ [kes 



But effective computation of the vr^'s may be not trivial if the distribution of $ 
is complex. Tille(1998) used an asymptotical approach to solve this problem when 
®( s ) = J2kes ^l[fces]; he has used normal approximations for the conditional and 
unconditional laws of $. 

In the previous sections, we have given examples where we were able to compute the 
7r^ v 's (and actually the p A ^(s) , s) exactly. In this section, we give a general Monte 
Carlo method to compute the tt Av . 



6.1 Monte Carlo 

We will use Monte Carlo simulations to estimate E(l^l[fces]) an d E(l^ ). We 
repeat independently K times the sample selection with the sampling design p(s), 
thus obtaining a set of samples (si, . . . ,Sk)- For each simulation i, we compute 
<I>(sj) and I Aip (si). Then we compute N + 1 statistics: 

K 

\/keU,M% = ^l A >,)l [fe6Si] 
i=\ 

We obtain a consistent estimator of 7r^, as K — > +oo: 

^ = M-/K = Ml 
k M A */K M A v 1 J 



6.2 Point and variance estimations in conditional inference 

Definition 3. The Monte Carlo estimator of the total t y is the conditional Horvitz- 
Thompson estimator of t y after replacing the conditional inclusion probabilities by 
their Monte Carlo approximations: 



V- 1 

ty,MC = TA~Vk 
k&s n k 

The Monte Carlo estimator of the variance ofi Vj MC is: 

W yM c)= }^ TA V TA V TA V Kf ^ 

where 



k,ies n k,i n k n ' 



7Y 



A v = YsiLi t y^-A v (Sj)l [kes t ] 1 jies t ] 
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Fattorini(2006) established that i Vj MC is asymptotically unbiased as M A,fi — > oo, and 
that its mean squared error converges to the variance of t Vt HT- 

Thompson and Wu (2008) studied the rate of convergence of the estimators fc^ and 
of the estimator t V) MC following Chebychev's inequality. Using normal approxima- 
tion instead of the Chebychev's inequality gives more precise confidence intervals. 
We have thus a new confidence interval for n^: 



P 



\^-^\<F'\(l-a)/2)^^j<a, 



where F is the distribution function of the normal law Af(0, 1). 
As for the relative bias, standard computation leads to: 

p f \iy,CHT ~ t y ,CHT\ < \ > x _ 4>< \[M^f] 
V t y>C HT J fcp , L \ 1 + e J 



> 1 - 4n^L -^±^- ■ e'i^ ), (2) 



where tt = min{7r^, k E U}. We used the inequality 1 — F(u) < which is 

verified for large u. 



The number K of simulations is set so that Yld=i^A^{si) reaches a pre-established 
M Av> value. Because of our conditional framework, K is a stochastic variable which 

M A <> 

follows a negative binomial distribution and we have E(K) = . For instance, 

iip(A^) = 0.05 = 5%, with M A * = 10 6 , we expect E(K) = 2.10 7 simulations. 



7 Conditional Inference Based on Monte Carlo Method 
in Order to Adjust for Outlier and Strata Jumper 

We will apply the above ideas to two examples close to situations that can be found 
in establishments surveys: outlier and strata jumper. 

We consider an establishments survey, performed in year "n+1", and addressing 
year "n". The auxiliary information x which is the turnover of the year "n" is not 
known at the sampling stage but is known at the estimation stage (this information 
may come from, say, the fiscal administration). 



7.1 Outlier 

In this section, the auxiliary variable x is simulated following a gaussian law, more 
precisely Xk ~ 9t(8 000, (2 000) 2 ) excepted for unit k — 1 for which we assume that 
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Xi = 50 000. The unit k = 1 is an outlier. The variable of interest y is simulated by 
the linear model 

y k = 1000 + 0.2 x k + u k , 

where u k ~ 91(0, (500) 2 ), u k is independent from x k . The outcomes are fj, x — 8 531 
and fi y = 2 695. 

We assume that the sampling design of the establishments survey is a SRS of size 
n = 20 out of the population U of size N = 100 and that the selected sample so 
contains the unit k = 1. For this example, we have repeated the sample selection 
until the unit 1 has been selected in so- 

We obtain $(s ) = fa x ,HT( s o) — 9 970, which is 17% over the true value /i x = 8 531 
and fiy,HT{sa) — 3 039 (recall that the true value of fi y is 2 695). 
We set $ and ip as in section [5] and we use Monte Carlo simulations in order to 
compute the conditional inclusion probabilities K k v ■ Each simulation is a selection 
of a sample following a SRS of size n = 20from the fixed population U . Recall that 
the value of x k will eventually be known for any unit k G U. 

Actually, we use two sets of simulations. The first set is performed in order to es- 
timate the cdf of the statistic $(s) = {i x ,ht{s) which will be used to condition 
on. This estimated cdf will enable us to construct the interval ip. More pre- 
cisely, we choose the interval ip = [9 793, 10 110] by the means of the estimated 
cdf of $(s) = fi y ,HT(s) and so that p([P, X} ht(s) € [9 793, fi X}HT (s )]\) = f = 2.5% = 
p(\Mx,ht(s) E [fi x ,HT(s ), 10 110]]). 

A v is then the set of the possible samples in our conditional approach: 
A v = {se S, fi x , HT {s) e [9 793, 10 110]}. 

Note that p ([£i x ,ht(s) £ [9 793, 10 110]]) = a = 5%. A v typically contains samples 
that over-estimate the mean of x. 



The second set of Monte Carlo simulations consists in K = 10 6 sample selections 
with a SRS of size n = 20 performed in order to estimate the conditional inclusion 
probabilities 7T fc *\ 49 782 (4.98%) simulated samples fall in A v , and among them, 
49 767 samples contain the outlier, which correspond to the estimated conditional 
inclusion probability of the outlier: tt 1 v = 0.9997. It means that almost all the 
samples of A v contain the outlier that is mainly responsible for the over-estimation 
because of its large value of the variable x\ 

The weight of the unit 1 has changed a lot, it has decreased from d k = = 5 to 
dt v = 1.0003. The conditional sampling weights of the other units of s are more 



comparable to their initial weights d k = 5 (see Figure 7.1). 



The conditional MC estimator fiy t Mc(s) = J2kes J ^ leads to a much better esti- 
mation of \x y : p, y ,Mc{so) = 2 671. 



Figure 7.1 gives an idea of the conditional inclusion probabilities for all the units of U . 



d v 

Moreover, this graph shows that the correction of the sampling weights = is 
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Density of ifc(s) 




not a monotonic function of Xk, which is in big contrast with calibration techniques 
which only uses monotonic functions for weight correction purposes. 



Sampling Weights Corrections 




10600 20600 30600 40600 50600 

Figure 3: Outlier, Density of $(s) = fi Xt HT(s) 

A last remark concerns the distribution of the statistics = £i x ,ht{s). Figure 
[3] shows an unconditional distribution with 2 modes and far from gaussian. This 
shows that in presence of outlier, we can not use the method of Tille (1999), which 
assumes a normal distribution for £i x ,ht{s). 



7.2 Strata Jumper 

In this section, the population U is divided into 2 sub-populations: the small firms 
and the large firms. Let us say that the size is appreciated thanks to the turnover 
of the firm. Official statistics have to be disseminated for this 2 different sub- 
populations. Hence, the survey statistician has to split the population into 2 strata 
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corresponding to the sub-populations. This may not be an easy job because the size 
of firms can evolve from one year to another. 

Here we assume that, at the time when the sample is selected, the statistician does 
not know yet the auxiliary information x of the turnover of the firm for the year 
"n", more precisely the strata the firm belongs to for the year "n". Let us assume 
that he only knows this information for the previous year,"n-l". This information 
is denoted by z. In practice, small firms are very numerous and the sampling rate 
for this strata is chosen low. On the contrary, large firms are less numerous and 
their sampling rate is high. 

When a unit is selected among the small firms but eventually happens to be a large 
unit of year "n", we call it a strata jumper. At the estimation stage, when the 
information x becomes available, this unit will obviously be transferred to strata 2 . 
This will bring a problem, not due to its y- value (which may well be typical in strata 
2) but to its sampling weight, computed according to strata 1 (the small firms), and 
which will appear to be very large in comparison to the other units in strata 2 at 
the estimation stage. 

In our simulations, the population U is split in 2 strata, by means of the auxiliary 
variable z: U", of size iVf = 10 000, is the strata of presumed small firms and [/£, 
of size iV| = 100, the strata of presumed large firms. 

The auxiliary variable x, which is the turnover of the year " n" known after collection, 
is simulated under a gaussian law 91(8 000, (2 000) 2 ) for the units of the strata £7| 
and for one selected unit of the strata U( . Let us say that this unit, the strata 
jumper, is unit 1. 

Our simulation gives x\ = 8 002. The variable of interest y is simulated by the linear 
model yk = 1000 + 0.2 Xk + Wfc, where Uk ~ 91(0, (500) 2 ), Uk and Xk independent. 
We do not simulate the value of x and y for the other units of the strata because 
we will focus on the estimation of the mean of y for the sub-population of large firms 
of year n U%: fJ, V)2 = J2keu% We ^ n< ^ ^,2 = & ^8 and fi y 2 is 2 606. 

The sampling design of the establishments survey is a stratified SRS of size n\ = 400 
in U* and n 2 = 20 in Z7| ■ We assume that the selected sample Sq contains the unit 
k — 1. In practice, we repeat the sample selection until the unit 1 (the strata 
jumper) has been selected. 

As previously, $ and <p are defined as in Section [5] 

We use Monte Carlo simulations in order to compute the conditional inclusion prob- 
abilities 7r^ v . A simulation is a selection of a sample with stratified SRS of size 
ni = 400 in Ul and n 2 = 20 in U 2 . 

We choose the statistic $(s) = £ix,2,ht{s) in order to condition on. K = 10 6 simula- 
tions are performed in order to estimate the cdf of $(s) and the conditional inclusion 
probabilities. 

Our simulations give $(so) — A*i,2,,h"t(so) = 9 510, which is far from the true value 
l^x,2 = 8 138 and fiy^HT^So) = 3 357 (recall that the true value of /i y>2 is 2 606). 
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We choose the interval if = [8 961, 10 342] by the means of the estimated cdf of 
$0) = Vx,2,ht(s) and so that p ([fl x ,2,HT(s) G [8 961, 10 342]]) = a = 5%. 
is then the set of the possible samples in our conditional approach: 

A v = {s e S, (l x ,ht{s) e [8 961, 10 342]}. 

All samples in A v over-estimate the mean of x. 



Among the 10 6 simulations, 49 778 simulated samples (4.98%) belongs to A^. 55% 
of them contains the strata jumper, which gives the estimated conditional inclu- 
sion probability of the strata jumper 7t 1 lp = 0.55. It is not a surprise that the 
strata jumper is in one sample of A v over two. Indeed, its initial sampling weight 
= 10 4o°o 00 = 25 is high in comparison to the weights dk = ^ = 5 of the other 



selected units of the strata £/f , and its contribution contributes to over-estimate 
the mean of x . 



The conditional inclusion probabilities for the other units of t/f are comparable to 
their initial %k = 0.2 (see Figure |4|. 



The conditional MC estimator fi y> 2,Mc( s ) — jj Sfces leads to a better estimation 
of jjLyfl- fiy,2,Mc(so) = 2 649. 



Sampling Weights Corrections 



*.. 



"iciob 6obl) 8o5j 10600 12600 



Figure 4: Strata Jumper, Sampling Weight Corrections 



Figure [4] shows that sampling weights correction is here a non-monotonic function 
of the variable x. We point out that the usual calibration method would not be able 
to perform this kind of weights correction because the calibration function used to 
correct the weights should be monotonic. 

Similarly to the outlier setting, the unconditional distribution of the statistics <&(s) = 
P-x,2,ht{s) has 2 modes and is far from gaussian. 
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8 Conclusion 



At the estimation stage, a new auxiliary information can reveal that the selected 
sample is imbalanced. We have shown that a conditional inference approach can 
take into account this information and leads to a more precise estimator than the 
unconditional Horvitz-Thompson estimator in the sense that the conditional estima- 
tor is unbiased (conditionally and unconditionally) and that the conditional variance 
is more rigorous in order to estimate the precision a posteriori. 
In practise, we recommend to use Monte Carlo simulations in order to estimate the 
conditional inclusion probabilities. 

This technic seems particularly adapted to the treatment of outliers and strata- 
jumpers. 



18 



A Annex 1: Inclusion Probability with Conditional 
Poisson Sampling 



Proof. The event Yli^ui^k -fyes] — n — 1 is independent of the events [I[kes] = 0] 
and [I[k£ S ] — 1] i n the Poisson model. So we can write: 



P 



I[ie>] = n ~ l 

,lEU,l^k 



Equation (|3| gives: 



Z 7 IM = n - 1 

i&u,i^=k 



and equation Q gives: 



Z j im 



n—1 



So we have: 



P 



P 



.leu,i^k 



n — 1 



n—1 



/[I[kes] = 0] J (3) 
/ftM = l]| (4) 



/[/ [fces] =0] = p 



Z h& 



n-1 



/[likes] = 0] 



Z j im = n - 1 



/[/[fee*] = 0] 

lieu 

P([E^M = "~ 1 ])p([ J [fce B ] = 0]/ [Ejgi^gj = "-!]) 

Pd^fces] = 0]) 
V ( [Eisc/ Ijes\ = n - 1] ) (1 - / fc (JV, p, n - 1)) 



/[ J [fc£s] = 1] = P 



Z -fe* 



n—1 



/[Ilk 



.Jet; 



/ft 



fees] 



^([Eigjy^ga] = n])p([J [fc g s] = 1]/ [Eig^^es] = n]) 

= !]) 

P ( [Etgg jgej = n ] ) A P> n ) 
Pk 



f k (U,p,n) = (l-f k (U,p,n-l)) 



Pk 



P ( [Eieu J [M 



n 



1]) 



!-Pfc p([Ei e c7 J [M =«]) 



(l-/ fc (tf,p,n-l))-^-/i(tf,p,n) 
1 -Pfe 



And we can use the property ^2 k&u fk(U, p, n) = J2keu = nto compute h(U, p, n) 
and conclude. □ 
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